The internet is teeming with bots, some designed to help organizations streamline their workflow and others that cause unintended harm. Many people think of web crawlers, or “bots,” as friendly helpers that train AI and large language models (LLMs). While that’s true in some cases, there’s a darker side to web scraping. Some bots are used to harvest data without permission, potentially for malicious purposes. As artificial intelligence (AI) technologies continue to evolve, bots used for data training have become more prevalent, leading to issues that impact website performance and security.
Here at Servebolt, we’re constantly striving to provide the best possible experience for our customers. That includes keeping our infrastructure running smoothly and efficiently. Today, we made a significant change by blocking two notorious bots, Bytespider and ClaudeBot, from both Servebolt CDN and Accelerated Domains.
Let’s explore why we made this decision and how it aligns with our commitment to provide excellent customer service.
Who are Bytespider and ClaudeBot?
Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. It is alleged to be used to download training data for its LLMs, including those powering ChatGPT competitor Doubao.
ClaudeBot is an IRC bot powered by Hubot. It’s customized for the FyreChat IRC network and can be deployed on Heroku or a self-managed platform. It uses core scripts provided by Hubot but may deviate from them when necessary. ClaudeBot brings the power of multiple web APIs to a chat interface, making it a useful tool for various tasks.
The Darker Side of Web Crawling with Bytespider and ClaudeBot
Bots have taken over many important areas of businesses, such as customer service, financial advice, sales, and more. These bots are designed to interact with humans through an auditory or messaging interface, which has increased productivity, streamlined operations, and improved revenue flows. The excitement for bots across various organizations has not diminished; in fact, it has only increased. Since their inception, bots have continued to evolve and innovate, becoming smarter, more intelligent, and more empathetic. However, they also pose threats to organizations and people. How, you may ask?
The number of bots and crawlers hitting your websites on a daily basis is scary. While not all bots are inherently harmful, some behave in ways that make them difficult to manage. This is the case with Bytespider and ClaudeBot.
These bots often disregard robots.txt, a standard file that webmasters use to instruct bots on which pages should or shouldn’t be crawled and how often. They are using well-known hosting services to bypass normal blocking channels or methods such as firewalls, security plugins, etc.
By ignoring any set directives, these bots overburden servers with millions of requests, hammering websites to the tune of 5 requests per second, making them hard to detect and block effectively. Moreover, they employ tactics to evade rate limiting, making it challenging for hosting providers like Servebolt to manage their traffic effectively.
Typically, the bots’ origin IP geolocation is China. When blocking the bots’ User Agents, the origin IP geolocation changes to Singapore (another haven for malicious bots or bad actors).
Their crawling rates are extremely high.
- Bytespider: Generates millions of requests per day, causing significant strain on any infrastructure.
- ClaudeBot: Adds another million requests per day, further increasing the load on servers.
Why Blocking Was Necessary
Our decision to block Bytespider and ClaudeBot was not taken lightly. We’ve been meticulously monitoring their activities for some time now, and the numbers speak for themselves. We’re talking about approximately 3 million requests from Bytespider and around 2 million from ClaudeBot per day. While our infrastructure is perfectly capable of handling this, it does mean an increased power usage for all components involved, resulting in an increased carbon footprint without adding any benefits.
The decision was guided by the need to:
- Protect Customer Interests: Our customers depend on Servebolt to deliver fast and secure websites. By blocking bots that generate millions of unnecessary requests, we can provide better performance for legitimate users.
- Preserve Infrastructure: Excessive bot traffic consumes traffic and server resources that could be used more efficiently. By removing unnecessary requests, we can allocate resources more effectively, leading to improved server efficiency.
- Enhance Security: Bots like Bytespider and ClaudeBot often circumvent detection and mitigation measures. Blocking them helps to safeguard our infrastructure against potential vulnerabilities.
Most importantly, these bots have failed to demonstrate any tangible value to Servebolt’s customers.
The Servebolt Approach to Bot Management
Servebolt has a multi-layered approach to bot management that ensures only valuable traffic reaches our customers’ websites. Here’s how we handle bots:
- Identification and Analysis: We closely monitor traffic patterns to identify bots and analyze their behavior. This allows us to differentiate between beneficial bots (like search engines) and those that are harmful or unnecessary.
- Blocking Strategies: Once identified, harmful bots are blocked using various methods, including, but not limited to, IP blacklisting, firewall rules, and behavioral analytics. We tailor our blocking strategies to target specific bots without affecting legitimate traffic.
- Continuous Monitoring: Bot behavior evolves rapidly. Servebolt regularly updates its bot management strategies to adapt to new challenges, ensuring that our customers remain protected against emerging threats.
- Transparent Communication: We believe in keeping our customers informed about decisions that impact their service. Blocking Bytespider and ClaudeBot is part of our ongoing commitment to maintaining a transparent relationship with our customers.
The Broader Implications
The issue of bot traffic is not limited to Servebolt alone. It’s a challenge faced by website owners globally. While bots play a significant role in powering AI and machine learning (ML) systems, not all bots add value to the websites they visit. Some, like Bytespider and ClaudeBot, prioritize their data-gathering objectives over the well-being of the sites they crawl.
In an era where data privacy and security are paramount, it’s crucial for companies to prioritize ethical bot management. But bot management isn’t just about detecting threats; it’s also about having a tailored response strategy.
Servebolt remains committed to this principle to safeguard your business from malicious actors, ensuring that our services are not exploited by bad bots.
In light of this, we encourage a broader conversation within the tech community about responsible bot usage and the ethical considerations surrounding data scraping and web crawling. By fostering dialogue and collaboration, we can work towards a more transparent and equitable internet ecosystem for all.
Conclusion
Servebolt’s decision to block Bytespider and ClaudeBot underlines our dedication to providing high-performing, secure, and customer-focused service. By understanding these bots’ behavior and acting accordingly, we can better protect our customers from unnecessary server strain, security risks, and additional costs.